feat: nms op #4246

mertalev · 2025-12-22T20:21:02Z

Pull Request Template

Checklist

Confirmed that cargo run-checks command has been executed.
Made sure the book is up to date with changes in this PR.

Changes

Adds an nms op to burn-vision for detection use-cases. It matches the options of the NonMaxSuppression ONNX op to make it easy to support in later PRs.

Testing

I compared outputs from torchvision and this implementation and confirm they match with the same settings.

Note: I also wrote a CubeCL kernel for GPU acceleration, but try as I might, it was just slower than the CPU SIMD implementation. The data size is not that large for most applications and part of the algorithm is inherently sequential. The best I could get it was ~70% slower than CPU for 800 boxes, 16% slower for 3200 and about the same for 12800. I decided to omit it from the PR because it's just faster to do it on CPU and transfer back. Maybe there are use-cases or scenarios I'm not considering.

cubecl impl block-parallel bitmask-parallel two phase optimize for cpu formatting simplify response remove kernel refactor tensor op for ergonomics linting, tests

codecov · 2025-12-22T21:16:36Z

Codecov Report

❌ Patch coverage is 93.10345% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.88%. Comparing base (9f31281) to head (58df53d).

Files with missing lines	Patch %	Lines
crates/burn-vision/src/backends/cpu/nms.rs	94.20%	8 Missing ⚠️
crates/burn-vision/src/ops/base.rs	56.25%	7 Missing ⚠️
crates/burn-vision/src/tensor.rs	83.33%	1 Missing ⚠️

❌ Your project check has failed because the head coverage (68.88%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4246      +/-   ##
==========================================
+ Coverage   68.85%   68.88%   +0.03%     
==========================================
  Files        1405     1407       +2     
  Lines      167686   167918     +232     
==========================================
+ Hits       115456   115669     +213     
- Misses      52230    52249      +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AdrianEddy · 2025-12-25T16:34:12Z

for context:
I had an use case for GPU-based nms, where I needed to run my entire pipeline on the GPU to avoid ANY transfers of data between CPU and GPU except the first jpeg bytes, and last final output. That pipeline ran entirely on the GPU and I've written the NMS with CUDA to achieve that, and it was much faster than doing it in CPU mainly because I avoided any data transfers whatsoever.
Did you consider that the source and output data was already in the GPU for your benchmarks? I think the data transfer itself might be a bottleneck here, so when your source data is on the CPU, the CPU version is faster, just because of that data transfer.

I'm not saying it is, just saying it may still be useful to have the GPU-only NMS, as I have used that myself.

I believe having the both versions available is ideal here, so users can choose which one fits their use case best

mertalev · 2025-12-25T17:20:56Z

The data was on the device being tested at the start, no transfers included. I didn't test the time to transfer the NMS input to CPU and back, though.

I can include the kernel for completeness. I only tested a specific scenario on Apple Silicon, so maybe it's worth it in other cases. I'm sure the kernel has room for optimization as well.

AdrianEddy · 2025-12-25T17:25:11Z

one additional factor to consider here is that on apple silicon the data transfers are much faster because of unified memory. On Linux/CUDA the transfer itself might have a much bigger overhead because it's a separate device

so in that case, even if the NMS algorithm itself on GPU is potentially slower - it may still be beneficial to run the slower algorithm on device but avoid data transfer, than running a faster algorithm on CPU but pay for data transfer

laggui

Great job! Thanks for the accelerated NMS implementation.

Just some minor comments below.

For a GPU implementation, I think the benefits typically come with a batched NMS implementation, which involves some tricks. Most commonly, offset the box coordinates per class so it's applied independently (ref: torchvision batched_nms).

laggui · 2026-01-06T17:50:49Z

crates/burn-vision/src/backends/cpu/nms.rs

+    // Filter by score threshold and convert to SoA format in one pass
+    let score_thresh = options.score_threshold;
+    let mut x1s = Vec::with_capacity(n_boxes);
+    let mut y1s = Vec::with_capacity(n_boxes);
+    let mut x2s = Vec::with_capacity(n_boxes);
+    let mut y2s = Vec::with_capacity(n_boxes);
+    let mut areas = Vec::with_capacity(n_boxes);
+    let mut filtered_scores = Vec::with_capacity(n_boxes);
+    let mut original_indices = Vec::with_capacity(n_boxes);


Instead of allocating 5 separate Vec<f32>s, it would likely be better to have a single vec for filtered boxes and then use a single contiguous "scratchpad" for the filtered box data to reduce this into a single heap allocation:

// After filtering and sorting, we know the number of filtered boxes `n_filtered` let mut scratchpad = vec![0.0f32; n_filtered * 5]; // Split into slices let (x1s, rest) = scratchpad.split_at_mut(n_filtered); let (y1s, rest) = rest.split_at_mut(n_filtered); let (x2s, rest) = rest.split_at_mut(n_filtered); let (y2s, areas) = rest.split_at_mut(n_filtered); for (i, b) in filtered.iter().enumerate() { // Fill the slices with the filtered box coordinates and areas }

And the previous filtering can simply be applied on the indices instead:

let mut filtered_indices = Vec::with_capacity(n_boxes); for i in 0..n_boxes { let score = scores_vec[i]; if score >= options.score_threshold { filtered_indices.push(i); // original index } } let n_filtered = filtered_indices.len(); if n_filtered == 0 { return vec![]; } // Note: I used total_cmp instead (which should handle NaNs instead of panicking) filtered_indices.sort_by(|&a, &b| scores_vec[b].total_cmp(&scores_vec[a]));

This also makes it easier to implement aligned loads in the future if desired. Even without that, the single contiguous allocation should already be an improvement I believe.

You could benchmark both implementations to see if it's actually worth it.

It made for a nice improvement!

Unaligned:

Benchmark Burn Version Shapes Feature Backend Device Median

nms-n100-f32 local [(100, 4)(100)] ndarray ndarray Cpu 17.000µs

nms-n1000-f32 local [(1000, 4)(1000)] ndarray ndarray Cpu 438.000µs

nms-n10000-f32 local [(10000, 4)(10000)] ndarray ndarray Cpu 21.385ms

nms-n5000-f32 local [(5000, 4)(5000)] ndarray ndarray Cpu 6.785ms

Aligned:

Benchmark Burn Version Shapes Feature Backend Device Median

nms-n100-f32 local [(100, 4)(100)] ndarray ndarray Cpu 16.000µs

nms-n1000-f32 local [(1000, 4)(1000)] ndarray ndarray Cpu 373.000µs

nms-n10000-f32 local [(10000, 4)(10000)] ndarray ndarray Cpu 17.337ms

nms-n5000-f32 local [(5000, 4)(5000)] ndarray ndarray Cpu 5.460ms

With all_suppressed optimization:

Benchmark Burn Version Shapes Feature Backend Device Median

nms-n100-f32 local [(100, 4)(100)] ndarray ndarray Cpu 15.000µs

nms-n1000-f32 local [(1000, 4)(1000)] ndarray ndarray Cpu 276.000µs

nms-n10000-f32 local [(10000, 4)(10000)] ndarray ndarray Cpu 11.933ms

nms-n5000-f32 local [(5000, 4)(5000)] ndarray ndarray Cpu 3.692ms

Thanks for reporting the benchmarks! That's awesome.

Assuming the "unaligned" results are from the previous version, I think the biggest jump in "aligned" likely comes from the single vec allocation (for reads). But the aligned vec is a nice addition on top.

crates/burn-vision/src/backends/cpu/nms.rs

laggui · 2026-01-06T17:53:42Z

crates/burn-vision/src/ops/base.rs

+/// Non-Maximum Suppression options.
+#[derive(Clone, Copy, Debug)]
+pub struct NmsOptions {
+    /// IoU threshold for suppression (default: 0.5).
+    /// Boxes with IoU > threshold with a higher-scoring box are suppressed.
+    pub iou_threshold: f32,
+    /// Score threshold to filter boxes before NMS (default: 0.0, i.e., no filtering).
+    /// Boxes with score < score_threshold are discarded.
+    pub score_threshold: f32,
+    /// Maximum number of boxes to keep (0 = unlimited).
+    pub max_output_boxes: usize,
+}


Future notes: might be useful to extend box format to [cx, cy, w, h] on top of the current [x1, y1, x2, y2] format.

mertalev · 2026-01-07T02:34:48Z

Great job! Thanks for the accelerated NMS implementation.

Just some minor comments below.

For a GPU implementation, I think the benefits typically come with a batched NMS implementation, which involves some tricks. Most commonly, offset the box coordinates per class so it's applied independently (ref: torchvision batched_nms).

Thanks! After squeezing out some more performance, I'm happy with the state of the PR, but I think it's probably worthwhile to circle back to the kernel in the future, possibly when adding batch_nms as you say.

laggui

Thanks for addressing my comments!

LGTM, pending one small change I believe is still required.

Approving in advance nonetheless.

laggui · 2026-01-07T13:39:44Z

crates/burn-vision/src/backends/cpu/nms.rs

+            match lanes {
+                4 => *(suppressed.as_ptr().add(i) as *const u32) == 0x01010101,
+                8 => *(suppressed.as_ptr().add(i) as *const u64) == 0x0101010101010101,
+                16 => {
+                    *(suppressed.as_ptr().add(i) as *const u128)
+                        == 0x01010101010101010101010101010101
+                }
+                _ => unreachable!(),
+            }


This casts a *const bool (1-byte aligned) to *const u32/u64/u128 without proper alignment. In practice I think it will work on modern platforms that typically over-align (as I've discovered in the past), but still technically UB.

I think an easy fix would be to simply read unaligned, e.g.

suppressed.as_ptr().add(i).cast::<u32>().read_unaligned()

add nms op

0a001ae

cubecl impl block-parallel bitmask-parallel two phase optimize for cpu formatting simplify response remove kernel refactor tensor op for ergonomics linting, tests

tweaks

a9f9b86

antimora requested a review from wingertge December 27, 2025 17:58

laggui reviewed Jan 6, 2026

View reviewed changes

mertalev added 7 commits January 6, 2026 19:00

single aligned allocation

244946b

Merge branch 'main' into nms

24d3a5c

fix duplicate std

fadc4a0

remove one_v

a5fb7ad

clippy

1457576

optimized fast path

cc5b727

formatting

58df53d

laggui approved these changes Jan 7, 2026

View reviewed changes

Benchmark	Burn Version	Shapes	Feature	Backend	Device	Median
nms-n100-f32	local	[(100, 4)(100)]	ndarray	`ndarray`	Cpu	17.000µs
nms-n1000-f32	local	[(1000, 4)(1000)]	ndarray	`ndarray`	Cpu	438.000µs
nms-n10000-f32	local	[(10000, 4)(10000)]	ndarray	`ndarray`	Cpu	21.385ms
nms-n5000-f32	local	[(5000, 4)(5000)]	ndarray	`ndarray`	Cpu	6.785ms

feat: nms op #4246

Are you sure you want to change the base?

feat: nms op #4246

Conversation

mertalev commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Template

Checklist

Changes

Testing

Uh oh!

codecov bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AdrianEddy commented Dec 25, 2025

Uh oh!

mertalev commented Dec 25, 2025

Uh oh!

AdrianEddy commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laggui left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laggui Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

mertalev Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laggui Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

laggui Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

mertalev commented Jan 7, 2026

Uh oh!

laggui left a comment

Choose a reason for hiding this comment

Uh oh!

laggui Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mertalev commented Dec 22, 2025 •

edited

Loading

codecov bot commented Dec 22, 2025 •

edited

Loading

AdrianEddy commented Dec 25, 2025 •

edited

Loading

laggui left a comment •

edited

Loading

mertalev Jan 7, 2026 •

edited

Loading

laggui Jan 7, 2026 •

edited

Loading